Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Make dx histogram behavior consistent with px #1002

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

jnumainville
Copy link
Collaborator

Fixes: #668

The following examples are now basically consistent.
During testing I discovered #1001 so the legend titles are not set in a couple of these, but that is beyond the scope of this ticket as that is a general plot by issue, not a histogram specific one.

Note a few intentional differences in the args. It's not expected that dx histograms and px histograms aggregate in the same exact way. Plotly infers the equivalent of range_bins hence the extra argument. Additionally we have barmode="group" as our default whereas in Plotly it is barmode="relative". I believe this was an intentional choice on our end because it's hard to compare the different traces with barmode="relative" although I don't have a record of it.

import plotly.express as px
import deephaven.plot.express as dx

df = px.data.tips()
fig = px.histogram(df, x="size", y="total_bill")
fig2 = px.histogram(df, x="size", y="total_bill", histfunc="count", nbins=6)
fig3 = px.histogram(df, x="size", y="total_bill", orientation="h", nbins=6)
fig4 = px.histogram(df, x="size", y="total_bill", orientation="h", histfunc="count", nbins=6)
fig5 = px.histogram(df, x=["tip", "total_bill"], y="size", nbins=6)
fig6 = px.histogram(df, x=["tip", "total_bill"], y="size", orientation="h", nbins=6)
fig7 = px.histogram(df, x="size", y=["tip", "total_bill"], nbins=6)
fig8 = px.histogram(df, x="size", y=["tip", "total_bill"], orientation="h", nbins=6)
fig9 = px.histogram(df, x=["tip", "total_bill"], y="size", histfunc="count", nbins=6)
fig10 = px.histogram(df, x=["tip", "total_bill"], y="size", orientation="h", histfunc="count")
fig11 = px.histogram(df, x="size", y=["tip", "total_bill"], histfunc="count", nbins=6)
fig12 = px.histogram(df, x="size", y=["tip", "total_bill"], orientation="h", histfunc="count", nbins=6)
fig13 = px.histogram(df, x="size", y=["tip", "total_bill"], histnorm="probability", barnorm="percent")
fig14 = px.histogram(df, x="size")
fig15 = px.histogram(df, y="size")

fig1d = dx.histogram(df, x="size", y="total_bill", nbins=6, range_bins=[0, 7])
fig2d = dx.histogram(df, x="size", y="total_bill", histfunc="count", nbins=6, range_bins=[0, 7])
fig3d = dx.histogram(df, x="size", y="total_bill", orientation="h", nbins=6, range_bins=[0, 60])
fig4d = dx.histogram(df, x="size", y="total_bill", orientation="h", histfunc="count", nbins=6, range_bins=[0, 60])
fig5d = dx.histogram(df, x=["tip", "total_bill"], y="size", nbins=6, range_bins=[0, 60], barmode="relative")
fig6d = dx.histogram(df, x=["tip", "total_bill"], y="size", orientation="h", nbins=6, range_bins=[0, 7], barmode="relative")
fig7d = dx.histogram(df, x="size", y=["tip", "total_bill"], nbins=6, range_bins=[0, 7], barmode="relative")
fig8d = dx.histogram(df, x="size", y=["tip", "total_bill"], orientation="h", nbins=6, range_bins=[0, 60], barmode="relative")
fig9d = dx.histogram(df, x=["tip", "total_bill"], y="size", histfunc="count", nbins=6, range_bins=[0, 60], barmode="relative")
fig10d = dx.histogram(df, x=["tip", "total_bill"], y="size", orientation="h", histfunc="count", nbins=6, range_bins=[0, 7], barmode="relative")
fig11d = dx.histogram(df, x="size", y=["tip", "total_bill"], histfunc="count", nbins=6, range_bins=[0, 7], barmode="relative")
fig12d = dx.histogram(df, x="size", y=["tip", "total_bill"], orientation="h", histfunc="count", nbins=6, range_bins=[0, 60], barmode="relative")
fig13d = dx.histogram(df, x="size", y=["tip", "total_bill"], histnorm="probability", barnorm="percent", range_bins=[0,7], nbins=6, barmode="relative")
fig14d = dx.histogram(df, x="size", nbins=6)
fig15d = dx.histogram(df, y="size", nbins=6)

@jnumainville jnumainville requested review from a team and mattrunyon and removed request for a team November 8, 2024 16:24
margaretkennedy
margaretkennedy previously approved these changes Nov 8, 2024
@@ -47,6 +47,27 @@ hist_3_bins = dx.histogram(setosa, x="SepalLength", nbins=3)
hist_8_bins = dx.histogram(setosa, x="SepalLength", nbins=8)
```

### Bin and aggregate on different columns

If both `x` and `y` are specified, the histogram will be binned across one column and aggregated on the other.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would mention something like "If the plot orientation is vertical, the x column will be binned and the y column aggregated. The operations are flipped if the plot orientation is horizontal."

I don't know which orientation corresponds with which pairing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 898 to 899
# only one should be set
if hist_agg_label_h:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a bit clearer to make the args hist_agg_label and hist_orientation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +346 to +350
Column values must be numeric. If x is specified,
the bars are drawn vertically by default.
y: A column name or list of columns that contain y-axis values.
Only one of x or y can be specified. If y is specified, the
bars are drawn vertically.
Column values must be numeric. If only y is specified,
the bars are drawn horizontally by default.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making sure this has different behavior from bar in that the columns must be numeric. From the bar orientation docstring it looks like 1 value can be non-numeric

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's intentional. Our histogram never allows non-numeric data. Our bar does. It's worth noting that the logic in calculate_bar_orientation isn't called by bar when both x and y are specified, only when x or y is (as that is the logic that was formally known as frequency_bar) so the orientation in that case is set by the wrapped px.bar function.

Comment on lines +162 to +164
relabeled_agg_col = (
labels.get(self.agg_col, self.agg_col) if labels else self.agg_col
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ternary necessary here? Since there's a default empty dict for labels, I think you can just get rid of the ternary since the else should never hit? I guess unless a user specified labels=None maybe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's for when labels=None

Comment on lines +185 to +188
elif self.histnorm == "probability":
hist_agg_label = f"fraction of sum of {hist_agg_label}"
elif self.histnorm == "percent":
hist_agg_label = f"percent of sum of {hist_agg_label}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these X of sum of Y? Looks like this is the case where histfunc is not count or sum. So is there a histfunc that should be required for these cases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how plotly labels it https://github.com/plotly/plotly.py/blob/master/packages/python/plotly/plotly/express/_core.py#L239
More specifically though, hist_agg_label would contain histfunc at this point.

Comment on lines 6 to 10
class UnivariateAwarePreprocessor:
"""
A preprocessor that stores useful args for plots where possibly one of x or y or both can be specified,
which impacts the orientation of the plot in ways that affect the preprocessing.
Should be inherited from.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be abstract?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 14 to 15
pivot_vars: The vars with new column names if a list was passed in
list_var: The var that was passed in as a list
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by what these do. Is pivot_vars column renames? Is list_var supposed to be something like x or y?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pivot_vars is kind of a rename - it's for when you pass in a list so the tables need to be stacked with resulting value and variable columns, although they may or may not be named exactly that if those columns already exist in the table. It's not a great name for that dict. I think I'll go ahead and refactor it as part of this.
list_var would be x or y yes, whichever parameter was a list originally

Comment on lines 49 to 58
if self.args.get(self.agg_var):
self.agg_col: str = (
pivot_vars["value"]
if pivot_vars and list_var and list_var == self.agg_var
else args[self.agg_var]
)
else:
# if bar_var is not set, the value column is the same as the axis column
# because both the axis bins and value are computed from the same inputs
self.agg_col = self.bin_col
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the comment in the else say agg_var? The if only checks agg_var, not bar_var

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

x and y behavior in dx.histogram is not fully consistent with px.histogram
3 participants